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ABSTRACT 

High stakes tests are increasingly used to monitor systemic 
improvements in mathematics and teachers are expected to rely on the results 
of such tests to adapt their instructional practices . We examine the Texas 
Assessment of Academic Skills (TAAS) over a period of three years to examine 
to what extent its results can be used to guide instructional decision- 
making. We present the results of an expert content analysis of the 10th 
grade TAAS mathematics test for 1999, 2000, and 2001 which reveal that 
problem solving objectives mask significant content emphases. We further 
examine the variation in raw scores by objective across grades and years to 
show this information is not reliable enough to guide changes in instruction. 
We examine the sampling of topics within objective to gauge the distribution 
across topics. Finally, we attempt unsuccessfully to account for changes in 
difficulty using a combination of changes in sampling, item characteristics 
and composition of distracters. This leads us to question the utility of 
providing teachers raw data by objective and points to the urgency of 
developing better methods to link content analyses and psychometric methods 
of scoring. (Author) 
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High stakes tests are increasingly used to monitor systemic improvements in 
mathematics and teachers are expected to rely on the results of such tests to adapt 
their instructional practices. We examine the Texas Assessment of Academic Skills 
(TAAS) over a period of three years to examine to what extent its results can be used 
to guide instructional decision-making. We present the results of an expert content 
analysis of the 10*** grade TAAS mathematics test for 1999, 2000, and 2001 which 
reveal that problem solving objectives mask significant content emphases. We further 
examine the variation in raw scores by objective across grades and years to show this 
information is not reliable enough to guide changes in instruction. We examine the 
sampling of topics within objective to gauge the distribution across topics. Finally, 
we attempt unsuccessfully to account for changes in difficulty using a combination of 
changes in sampling, item characteristics and composition of distractors. This leads 
us to question the utility of providing teachers raw data by objective and points to 
the urgency of developing better methods to link content analyses and psychometric 
methods of scoring. 

Since 1995, the state of Texas relied on the Texas Assessment of Academic Skills 
^ (TAAS) to drive the current accountability system for Texas schools. Along with atten- 

dance and course credits, passage of he TAAS was required for a student to graduate. 
A new assessment, the Texas Assessment of Knowledge and Skills (TAKS) will be 
implemented in 2003, and although better aligned with secondary topics, it will pro- 
duce similar data artifacts for teacher use. TAAS results, in the form of raw scores by 
test objective, scaled scores known as Texas Learning Index (TLI) scores, and item 
analysis percentages have been provided to teachers and administrators to guide them 
in improving school performance and providing quality instruction in mathematics. 
Our interest was in examining these data to determine if they could validly be used 
for these purposes. Our interest in the question was piqued by data showing discrepant 
results as student scores on TLI increased while raw scores have declined, typically 
explained by the state as differences in test difficulty (Confrey & Carrejo, 2002). We 
sought to understand at the level of classroom practice, if the data provided by raw 
score for each of thirteen objectives could be used to make instructional changes, as it 
is widely believed. 

In the case of the TAAS test, we propose that even . within a single test and its 
analyses, drawing data-driven conclusions reveals how psychometric traditions for 
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creating scaled scores and equated tests seem to produce contrasting results from con- 
tent analyses. The key issue of content validity (Heubert & Hauser, 1999) “whether 
a test measures what it purports to measure and what conclusions can be drawn from 
the results - and whether the conclusions or inferences drawn from the test results are 
appropriate” (p. 71), requires one to link these two arenas. Thus our goal is to link 
psychometric and content analytical traditions by building protocols to help teachers 
refine their analysis and increase their statistical capacity to interpret and critique data 
and create plans of action (Confrey and Makar, 2002). In addition, we wish to argue 
for the role of outside content experts when conducting analyses of the TAAS. We 
argue for more discipline-based protocols for conducting and interpreting item analy- 
ses, and point out that these analyses are likely to be neglected or unpublished under 
the current accountability system. 

In this paper, we provide an account of our content analysis of the mathematics 
portion of the TAAS test for the 10^^ grade level over a period of three years (1999, 
2000, and 2001). We address, based on available information, the question, “Can 
data provided to teachers in the form of raw scores by objective provide an accurate 
description of student performance and support instructional decision making?” In 
addressing this question, we relied on answers to three sub-questions: 



1. Using an independent content protocol, do the constructed TAAS objectives 
provide an adequate description of the content tested? 

2. How much variation is there in students’ mean performance by these 
objectives across the grades over time? 

3. Can we identify the possible factors that result in changes in performance by 
objective through a content sampling and item analysis? 



As development of the new assessment, TAKS, is underway, examination of the 
TAAS, its construction, format, and content, provides important experimental ground- 
work for future analysis of the new test, designed to be closely aligned with TAAS. 



TAAS Construction and Format 



According to the test makers, the Texas Education Agency (TEA, 2001), con- 
struction of the TAAS first involves a review of state standards for mathematics, the 
Texas Essential Knowledge and Skills (TEKS), to determine appropriate learning 
objectives by grade level. Following this review, educator committees develop drafts 
of test objectives, linked to the TEKS, to be reviewed by teachers and other special- 
ists. The objectives are then refined based on feedback, and sample test items are 
written. Educator committees develop guidelines for assessing each objective which 
include eligible test content, test item formats, and sample items. A test blueprint is 
then developed by item writers, some of whom are identified as former teachers. TEA 
curriculum specialists then review the items. During this process, item review commit- 
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tees judge the content, difficulty, and bias of each item. Further item revision occurs 
and then they are field tested. Field-test data is analyzed for reliability, validity, and 
possible bias. Data review committees then use statistical analyses to determine if 
items are worthy of being used. The final blueprint is then developed. Field-test items 
are placed in an item bank and the final tests are built from the bank and are designed 
to be equivalent in difficulty from one administration to the next. 

TEA outlines thirteen objectives grouped in three categories: concepts, opera- 
tions, and problem solving. Within concepts there are five objectives: (1) Number 
Concepts, (2) Relations and Functions, (3) Geometric Properties, (4) Measurement, 
and (5) Probability and Statistics. Within operations are (6) Addition, (7) Subtraction, 
(8) Multiplication, and (9) Division. Within problem solving are the four remaining 
objectives: (10) Estimation, (11) Solution Strategies, (12) Representation, and (13) 
Reasonableness. TAAS questions are clustered in groups of four under each objective 
with the exception of Solution Strategies and Representations which have eight items 
each at the exit level. Therefore, sixty questions comprise the final multiple choice 
exit test with each question having a possible 4-5 answers. Furthermore, TEA outlines 
which state mathematics standard(s), i.e. TEKS standard(s), is tested by each afore- 
mentioned TAAS objective. 

Examining TAAS Content with an Expert Protocol 

Our approach to the analysis began by: 

1. Creating a protocol based on our selection of topics relevant to K-12 math- 
ematics education. The topics comprising the protocol are indicative of the 
breadth of subject matter teachers and researchers find most relevant to many, 
if not most, implemented curricula. Topics were also chosen based on current 
research in mathematics education. 

2. Obtaining copies of the tests (available from the TEA website) and categoriz- 
ing each item according to the constructed protocol without referring to their 
respective TAAS designation. If a general topic was missing for an item, it was 
identified and the categories were adapted until we could account for all items 
on the test. 

3. Comparing our categorization with that of TEA’S for all three years. 

Our protocol includes the following topics (followed by subtopics): a) numera- 
tion (scientific notation, sequences), b) geometry (angle, congruency, coordinate plots, 
formula, spatial reasoning, similarity, symmetry, and vertices, edges, and faces), c) 
measurement (linear/perimeter, area, volume, weight, and the Pythagorean Theorem), 
d) operations (addition, subtraction, multiplication, division), e) rate (ratio and propor- 
tion), f) probability (combination, experiment outcomes, and mean, median, mode), g) 
data and statistics, and h) equations (literal, equation with variable, inequality). For d) 
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operations, we constructed the following table to indicate specific number types used 
in the items we placed under this topic (see Figure 1). 

Likewise for g) data & statistics, we constructed the following table to indicate 
the representation format involved in the question and the representation involved in 
the answer choice (see Figure 2). 

Our categorization separated the topics of ratio and data analysis as separate cat- 
egories (see Figure 3). 

Our categorization separated the topics of ratio and data analysis as separate cat- 
egories (see Figure 3). We used it to examine and display content distribution on the 
test over the three years. The following chart shows the number of items per topic. It is 
in marked contrast with the test specifications which specify four items per objective 
for most objectives and eight for solution strategies and representation. 

One can see the emphasis on certain topics such as addition and subtraction 
remain relatively unchanged from year to year. However, other topics such as mul- 
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Topics 



|d 1999 ■ZOOO □ 2001 1 

Figure 3. Distribution of items by topic, 1999, 2000, 2001 



tiplication, division, ratio, equations, and data receive different emphasis from year 
to year. (Notably, measurement received a heavier emphasis in 2001 compared to the 
previous two years.) The major difference lies in our elimination of the Problem Solv- 
ing objectives (Estimation, Solution Strategies, Representations, and Reasonableness) 
and categorization of their items into content categories. The following table lists the 
item numbers in each of these Objectives and the heading under which we placed them 
on our protocol (see Figure 4). Also note the variation in topics year to year. 

The total number of items in the 1999 test related to multiplicative structures 
(multiplication, division, ratio, rate) is eighteen out of the sixty items. The total 
number of items in 2000 is eleven out of the sixty and the total number for 2001 is fif- 
teen out of sixty. We found multiplicative structures far more heavily represented than 
teachers recognized, and hence its importance in passing TAAS could have been easily 
neglected. Furthermore, TAAS Objective 5, Probability and Statistics, contains four 
items, yet on the overall tests, the total number of items related to data, statistics, and 
probability under our protocol is six out of sixty for 1999, nine out of sixty for 2000, 
and eight out of sixty for 2001 . This area is also underrepresented in the test specifica- 
tions relative to the actual test. We conclude that constructs in multiplicative structures 
and data and statistics (including probability) comprise 24 out of 60 or 40% of the 
items for 1999, 20 out of 60 or 33% of the items for 2000, and 23 out of 60 or 38% of 
the items for 2001. This analysis indicates how the Problem Solving category with its 
four Objectives mask content that teachers should emphasize in their instruction. We 
understand why the Problem Solving objectives were included but point out that they 
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Objective 10 -Estanatiai 


1999 


2000 


2001 


Item 


Protocd Topic 


Item 


Protocd Tcpic 


Item 


Protocd Tcpic 


23 


MuIVv/hde 


21 


Comb.Op/whdle 


21 
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24 


MuIVv/hde 


31 


Equadai/literal 


23 


Mdt/dec. 


36 


Divisioi/ivhdle 


42 


Comb.Op/dec. 
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Mdt/5^ 
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Objective 11 -Sdlutilan Strategies 


1999 


2000 


2001 


Item 


Protocd Tcpic 


Item 


Protocd Tcpic 


Item 


Protocd Tcpic 


22 


Ntimer/sequence 


22 


Ntimer/sequence 
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Equadcn/literal 


25 
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Objective 12 -Represaitatiats 


1999 
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Item 
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Item 
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Item 


Protocd Tcpic 


29 
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25 


Equadcn/ineQU^ 
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Objective 13 -Reasonaldeness 
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Dataft^e; verbal 
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Dat^t^e verbal 


27 


Mdt/dec. 


27 
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40 


D ataf bar. verbal 


28 
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40 
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44 
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Figure 4, Content protocol for objectives 10-13. 






PME Assessment 544 



9/18/02, 10:16:43 PM 



Research Reports 



545 



should be crossed dimensionally with the Content objectives rather than designed as 
their own categories. 

Variation in Students’ Mean Performance by Objective 

Our second investigation concerned the variation in student performance for 
each objective. Teachers were using declines or gains in performance by objective 
as evidence of instructional need or success. We decided to examine the variation in 
students’ mean scores in two ways. First, we examined student performance for TAAS 
assessments 1999, 2000, and 2001 disaggregated by objective. Then we examined 
performance over three years by grade level to see if any trend lines suggested patterns 
of improvement or decline (see Figure 5). 







Figure 5. Variation in mean scores for all grades. 



Note that two of the four Objectives related to operations; namely. Objectives 6 & 
7 show little variation for all grades. However, Objectives, 4, 5, and 11, Measurement, 
Probability & Statistics, and Solution Strategies respectively, show considerably more 
variation. These data raise questions whether raw scores reported by objective are 
relatively stable enough over time to guide instructional decisions. 

One possible explanation for the variation is that students are consistently gaining 
ground on particular objectives as teachers implement revised strategies for instruc- 
tion. To examine this question, we examined student performance on the objectives 
individually by grade over three years. Our analyses revealed little consistency in 
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results. From 1999-2000, over student performance measures on six Objectives 
increased, while performance decreased on the remaining seven Objectives. From 
2000-2001, student performance increased on two Objectives and decreased on the 
remaining eleven. These data suggest that variation in student performance by Objec- 
tive was due to improvements over time. Below, we provide the trend charts for Objec- 
tives 4, 5, and 1 1 for all grades as illustrative of the variation found. 
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Figure 8. Objective 7 trends for all grades. 

This analysis demonstrates that teachers cannot simply use changes in their mean 
student performance by objective to guide their planning. This seems obvious as one 
recognizes that the raw scores by objective are the product of not only the content of 
the objective but also the sampling of topics and the difficulty of the items and distrac- 
tors. This led us to our next protocol for analysis. 



Returning to content analytical methods, we tapped into another possible source 
of information for teachers that could be useful in examining the variation in sampling 
of subtopics in each objective. TEA outlines which TEKS standards are being aligned 
with each TAAS objective. We created a table outlining the items clustered under each 
TAAS Objective. Within each TAAS objective, we identified which associated TEKS 
standard(s) each Objective tested (TEA, 2000). Each clustered item within a TAAS 
Objective was aligned with an associated TEKS standard (for example, see Figure 9). 
Our conjecture was that too little variability would permit the test to lose validity as 
teachers could virtually drill students on likely topics for inclusion. We expected that a 
valid test over time would sample proportionately across subtopics. We also expected 
that if teachers were assuming consistency in certain items, changes in the sampling of 
those objectives would lead to drops in performance. 

Continuing this process for all the TAAS Objectives, we quantified the amount 
of variability in item sampling between 1999, 2000 and 2001. For example, since two 
items in 2000 are sampled from a new topic while two items remained from the old 
topic, we coded this as a 50% change. Eight item objectives can produce changes such 



Possible Content Factors Affecting Student Performance 
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TEKS standards aligned with TAAS Objective 1; 
Number Concepts 


1999 


2000 


2001 


(1)(A) compate and older rational numbers 


2. 15 


2.5 


8. 20 


(1) (C) approxi mate the value of i irati onal numbera 


8. 17 






(1)(D) express numbers in sdentific notation, including negative e^x>nents 




10. 13 


3 


(d)(3)(A) use patterns to geneiate the laws of e^qx^nents and applies them 






16 



Figure 9. TEKS to TAAS alignment for objective 1. 



as 37.5%. Likewise, one item in 2001 differs in alignment with the 2000 test. This 
constitutes a one out of four or 25% shift in content emphasis. Overall, we calculated 
the percent change in content alignment from 1999 to 2000 and then from 2000 to 
2001 (see Figure 10). 
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OOVOl 
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OOVOl 


% change 


25 
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50 


25 


37.5 


12.5 


62.5 


37.5 
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0 



Figure 10. Percent change in content alignment. 




Our initial analysis showed that from 1999 to 2000 and 2000 to 2001, five of 
thirteen objectives were unchanged. For 1999-2000, an additional four objectives and 
for 2000-2001, an additional seven objectives showed under fifty percent changes, 
showing only four objectives in the first year and no objectives in the second year were 
altered more than fifty percent. We examined the variation in student performance to 
see if there was a simple relationship between student performance and variation in 
sampling (see Figure 11). There was not. This is an area that requires a more sophisti- 
cated form of analysis. 

Our final analysis included an attempt to quantify difficulty in three ways: task 
consistency, task characteristics, and distractors. Within task consistency, we con- 
sidered whether the content being tested was comparable between years. When the 
content topic was the same, we noted changes in task characteristics including number 
type, language use, single-step versus multi-step procedures, and the use of the money 
context. Distractor analysis was performed at the level of identifying answer choices 
that could easily be disregarded an/or common misconceptions. We identified the 
Objectives that a) appeared to have task consistency, were amenable to test prepara- 
tion, and should produce relatively stable performance; b) objectives with varied dif- 
ficulty in the items (harder or easier) across years and c) objectives with varied items 
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and incommensurability in assessing difficulty. We compared our prediction and the 
item analysis provided by TEA of actual results of the percentage of students pass- 
ing. 

The results of our analyses showed the unreliability of making predictions based 
on our measures of item difficulty. We predicted that the 2000 test would be more dif- 
ficult on all objectives based on our measures of difficulty except for Objectives 6 and 
7 that we predicted would remain the same. According to the chart, we were correct for 
Objectives 5 and 9. Other results showed we were incorrect or only minimally correct 
for Objectives 10 and 13. Between 2000 and 2001 we had similar results. This is not 
surprising as test equating on TAAS is actually done using a Item Response Theory 
approach based in Rasch analysis. The difficulty levels are not publicly released. 

Based on our analysis of TAAS for tenth graders, we found it difficult to draw 
any conclusions about student performance and improvement based on objective level 
analysis. These results suggest serious doubts about how teachers, after quantifying 
such results, are supposed to use this information to make a judgment about perfor- 
mance and thereby influence their instructional decision-making. 

Conclusions 

Our research was focused on examining whether the data provided to teachers 
for instructional decision-making were valid for this purpose. We worked to extend 
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the analysis of the data in relation to the published test structure, and found that there 
were a number of problems in reconciling the results. The problem solving objectives 
obscured the underlying content dimensions making it difficult to judge the relative 
needs of students as regards topics. The variability in the student raw scores by objec- 
tives makes it unlikely that variations in student performance year to year represent 
real changes in student knowledge. A lack of trend data by objective suggests this 
variability is unlikely to represent systematic improvements in instruction. Finally, it 
did not appear that one could easily construct a content valid analysis of changes in 
difficulty in terms of topic selection, item characteristics or distractors. 

In future work, we plan to continue to work to develop protocols that can validly 
guide teachers in undertaking content analyses of test results that can inform instruc- 
tional decision-making. We hope to be able to link such analyses with the methodolo- 
gies of test equating to determine whether there is a way to resolve the competing 
influences of psychometric test analysis and score preparation and content analyses. 
We encourage our colleagues in mathematics education to become similarly involved 
in close analysis of content dimensions of testing, to ensure that reform efforts at the 
curricular and instructional level are consistent with the messages given teachers from 
high stakes tests. 
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